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ABSTRACT 

Objectives: Data linkage combines information from 
several clinical data sets. The authors examined 
whether coding inconsistencies for cardiovascular 
disease between components of linked data sets result 
in differences in apparent population characteristics. 
Design: Retrospective cohort study. 
Setting: Routine primary care data from 40 Scottish 
general practitioner (GP) surgeries linked to national 
hospital records. 

Participants: 240846 patients, aged 20 years or 
older, registered at a GP surgery. 
Outcomes: Cases of myocardial infarction, ischaemic 
heart disease and stroke (cerebrovascular disease) 
were identified from GP and hospital records. Patient 
characteristics and incidence rates were assessed for 
all three clinical outcomes, based on GP, hospital, 
paired GP/hospital (similar diagnoses recorded 
simultaneously in both data sets) or pooled GP/ 
hospital records (diagnosis recorded in either or both 
data sets). 

Results: For all three outcomes, the authors found 
evidence (p<0.05) of different characteristics when 
using different methods of case identification. 
Prescribing of cardiovascular medicines for ischaemic 
heart disease was greatest for cases identified using 
paired records (p<0.013). For all conditions, 30-day 
case fatality rates were higher for cases identified 
using hospital compared with GP or paired data, most 
noticeably for myocardial infarction (hospital 20%, GP 
4%, p=0.001). Incidence rates were highest using 
pooled GP/hospital data and lowest using paired data. 
Conclusions: Differences exist in patient 
characteristics and disease incidence for 
cardiovascular conditions, depending on the data 
source. This has implications for studies using routine 
clinical data. 
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BACKGROUND 

Primary care data sets are commonly used for 
assessment of cardiovascular outcomes. Such 
events often are associated with hospital- 
isation. 1 However, it is possible that the 
manner in which outcomes are coded and 
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Article focus 

■ Data linkage allows information to be combined 
from different routine clinical data sources. 

■ Previous work has shown differences between 
sources of data but has not examined this at the 
patient level. 

Key messages 

■ Patients' apparent characteristics, and disease 
incidence and severity, vary depending on 
whether primary care, hospital or combined 
definitions of cardiovascular events are used. 

■ Use of isolated routine primary care or hospital 
data may result in biased patient selection. 

■ This has implications in the public health arena, 
clinical trial patient recruitment and validity and 
reliability of secondary data in clinical trials. 

Strengths and limitations of this study 

■ The strengths of this study are the novel 
analytical approach, using a large routine data 
set linked at individual patient level from multiple 
GP surgeries. 

■ Limitations of this study include restricting our 
analysis to four coding groups, uncertainty as to 
whether GP and hospital events could be 
considered to be recorded simultaneously: 
potential diagnostic coding inaccuracies and the 
relatively small number of GP surgeries, which 
may not have been representative. 



recorded in electronic health records may 
differ between primary and secondary care. 
This may result not only in differences in 
the apparent incidence of a condition, 
depending on whether primary or secondary 
care records are used, but also in differences 
in the observed characteristics of patients. 
Studies have observed that variations in 
diagnostic criteria can affect estimates of 
disease prevalence, 2 and the complexities of 
clinical coding systems for electronic health- 
care records can lead to inconsistent data 
recording. 3 This will lead to uncertainties 
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with respect to disease prevalence and mortality, impact 
on clinical care, have additional health service implica- 
tions such as affecting funding 5 and potentially influ- 
ence identification of patients for clinical trials. Previous 
studies have compared general practice coding and 
disease prevalence with other unlinked data sources, 
including paper notes. 6 7 However, the effect of 
combining information from two sources has not been 
previously examined. This study used linked individual 
patient electronic health records collected from primary 
and secondary care to examine the effect of using data 
from different parts of the healthcare service on the 
incidence rates, case fatality rates and patient charac- 
teristics of myocardial infarction (MI), ischaemic heart 
disease (IHD) and cerebrovascular disease (CVD). 

METHODS 
Data sources 

Sixty general practitioner (GP) surgeries take part in the 
Scottish national Practice Team Information (PTI) 
project, of which 40 self-selected surgeries contributed to 
the data set used in this study. Practices involved in the 
PTI project provide routine central recording of clinical 
activity and morbidity from a sample of GP surgeries 
considered reasonably representative of the Scottish 
population. Practices are reimbursed to ensure that data 
recording is optimal. Clinical coding used the Read code 
system. Data are used to calculate national estimates and 
used by various organisations (eg, NHS Boards, Scottish 
Government) to inform policies and better understand 
health in Scotland. 

Patient details from the PTI data set were linked to the 
corresponding admissions recorded in Scottish national 
hospital data (the Scottish Morbidity Record, SMR-01) 
using probabilistic matching. Matching was based on 
Soundex-encoded name, date of birth, sex, postcode and 
a unique nationwide identifier, the Community Health 
Index. Experienced human review was used to set 
a threshold for linkage. A substantial proportion of 
patients in this GP cohort have no hospital admissions, 
and as such, it is difficult to know whether the absence of 
a match is either due to a genuine lack of corresponding 
hospital record or due to a false-negative error. Match 
rates are thus difficult to quantify, although the use of 
multiple identifiers should improve linkage quality. The 
linkage was carried out by the Information Services 
Division, NHS National Services Scotland. The work was 
approved by the Privacy Advisory Committee of NHS 
National Services Scotland. For the 2004—2006 period, 
SMR-01 data are considered to be 88% accurate. 8 SMR- 
01 records are generated for all inpatient hospital 
medical discharges and transfers. Coding is based on the 
International Statistical Classification of Diseases and 
Related Health Problems (ICD) system (ICD9 prior to 
2000, ICD10 thereafter), with up to six inpatient diag- 
noses per record. Accident and emergency, maternity 
and psychiatric admissions, along with outpatient atten- 
dances, are not recorded in SMR-01. SMR-01 itself is also 
routinely linked to national mortality data (General 



Registrar's Office for Scotland, GROS). SMR-GROS data 
are also used to generate Scottish National Statistics. 

Identification and classification of cases 

We first identified all records of MI, IHD and CVD from 
both GP and hospital data sets using the following Read 
codes (MI: G30%/35%/38%, Gyu34/35/36; IHD: G3%, 
Gyu3%; CVD: G6%, Gyu6%, F4236; where % indicates 
a 'wildcard' match) and ICD codes (MI: ICD10 121-22, 
ICD9 410; IHD including MI: ICD10 120-25, ICD9 
410—414; CVD (stroke) including haemorrhage and 
transient ischaemic attack (TIA): ICD 10 160-69, 
G45-46; ICD9 430-438). Hospital events were identi- 
fied from any of the six diagnostic positions. These were 
not necessarily first events. 

We then found all episodes of a similar GP and hospital 
event type occurring within a 30-day period and made 
the assumption that these pairings represented the same 
clinical event. Where the GP and hospital dates differed 
for these paired episodes, the first of the two dates was 
taken. The choice of 30 days was a pragmatic one but 
supported by visual evaluation of the distribution of time 
gaps between similar hospital and GP event types over 
a 2-year period. Of note, an event recorded by the GP 
does not necessarily require a face-to-face consultation or 
a referral to be made; hospital admissions will usually be 
retrospectively recorded by the GP, using the admission 
date as opposed to the data-entry date. 

Analysis was carried out over the period 1 January 2005 
to 1 January 2007. The total population was randomly 
allocated to one of four methods of identifying cardio- 
vascular events: those based on GP events only; those 
based on hospital events only; those based on pooled 
GP/hospital events, with an event in GP data only, 
hospital data only or both the GP and hospital data 
(although not necessarily occurring within 30 days); and 
those based on paired GP/hospital events (those 
recorded in both GP and hospital data within 30 days) . 
An episode was included as an incident event only if 
there was no record of a similar clinical event at any time 
prior to 1 January 2005 coded in the same data set(s). 

This method of identifying incident events is shown in 
figure 1. For example, for an event to be included using 
only GP data, the first event would have to be recorded 
by the GP during the 2-year period of interest, with no 
similar events recorded by the GP prior to 1 January 
2005; hospital data are completely ignored in this case. A 
similar approach is used for identifying events using 
hospital-only data, with GP records ignored in this situ- 
ation. For the third method, identifying events using 
pooled GP/hospital data, the first event needs to be 
recorded by either the hospital or the GP during the 
2-year study period; there must be no similar event 
recorded in either data set prior to 1 January 2005. For 
the final method, the first occurrence of paired (ie, 
within 30 days) records in both GP and hospital data sets 
constituted an incident event if it occurred during the 
2-year period; any unpaired GP or hospital records 
occurring prior to 1 January 2005 were ignored. 
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Figure 1 Identification of 
incident events. The figure shows 
how incident events can be 
identified from linked general 
practice (GP) and hospital data 
sets, for eight hypothetical 
patients, illustrating some of the 
potential coding combinations. 
Circles correspond to the 
presence of a GP (O) or hospital 
(•) clinical code, with numbers 
illustrating the order. Immediately 

adjacent circles represent codes occurring within 30 days of one another. It can be seen that, for any given patient, it is possible to 
classify them as having an incident event in up to four ways: GP data only, hospital data only, paired GP/hospital and pooled GP/ 
hospital; the code that identifies an incident event for each of these methods is shown on the right of the figure. Codes do not count 
as incident events if a further, similarly classified, event has occurred prior to the start of the study period. In our study, patients 
were randomly allocated to one of the four coding methods. For instance, if patient E was allocated to 'hospital only' coding, they 
would not be classified as having had an event; in contrast, they would be classified as having had an event if they were allocated 
to any of the other three coding methods. 



1 January 2005 



1 January 2007 



For each incident event, we determined the patient's 
age, sex, socioeconomic status (Scottish Index of 
Multiple Deprivation quintile), 9 recorded current 
smoking status, record of hypertension, record of dia- 
betes and Charlson Index. 10 Comorbidities, including 
Charlson Index, were determined from the GP data as 
the presence of any relevant diagnostic Read code prior 
to the incident episode date; the list of codes used is 
available from the authors on request. Although we have 
not formally evaluated performance of our Charlson 
Index Read code list, we match 87% of those events 
identified by the method described by Khan et al, 11 and 
as such believe that this represents a reasonable, albeit 
pragmatic, measure of comorbidity. Death from any 
cause within 30 days of the event was ascertained from 
linked national mortality (GROS) data. Drug therapy 
recorded in the GP record, starting prior to or within 
30 days after the event, and continuing for any period of 
time after the event, was ascertained for patients alive at 
30 days. Drug classes included were ACE inhibitors 
(including angiotensin receptor blockers), (3-blockers, 
calcium channel blockers, diuretics (including potas- 
sium sparing and combination diuretics), nitrates, 
statins and antiplatelet agents (aspirin or clopidogrel for 
MI or IHD; aspirin or dipyridamole for CVD). 

Statistical analysis 

Incidence rates were calculated excluding patients with 
events in the relevant data set(s) prior to 1 January 2005. 
Incidence rates are expressed per 100 000 patient-years 
(based on total number of days of follow-up for each 
patient within each respective group). Statistical differ- 
ences in patient characteristics (including drug treat- 
ment) between coding categories were evaluated using 
X 2 tests (for proportions) and Kruskal-Wallis non- 
parametric analysis of variance (for continuous data). 
The association between coding and 30-day case fatality 
was assessed by logistic regression, including the 
covariates age, sex, deprivation, smoking status, 
hypertension, diabetes and Charlson Index. Differences 



in the four incident rates obtained were examined using 
Poisson regression. 

Data management was carried out using Microsoft 
SQL Server 2000. Statistical analysis was performed using 
SPSS V.17 (SPSS Inc.). 

RESULTS 

Differences in identification of incidence events 

There were a total of 240 846 patients, evenly distributed 
between the four coding groups. Numbers of incident 
events are shown in table 1 . Incidence rates for the three 
conditions are shown in figure 2. There was strong 
evidence (p<0.001, Poisson regression) that the inci- 
dence rates for all three clinical conditions depends on 
which data set(s) are used to identify cases. In all cases, 
the pooled GP/hospital data produced the highest 
incidence rates (376, 1089 and 767 per 100 000 patient- 
years for MI, IHD and CVD, respectively), and the paired 
GP/hospital data gave the lowest incidence rates (188, 
489 and 272 per 100 000 patient-years, respectively). 
There was no evidence that the incidence rates based on 
only GP data differ from those of the hospital data for 
either MI (p=0.14) or CVD (p=0.27), but there was 
strong evidence that they were higher for IHD (975 and 
673 events per 100 000 patient-years for hospital and GP, 
respectively, p<0.001). The pooled GP/hospital data 
produced slightly higher incidence rates than hospital 
data alone for CVD (p<0.001) and marginally so for MI 
(p=0.048) and IHD (p=0.066). 

Patient characteristics 

Patient characteristics are shown in table 1 for all three 
clinical conditions. There was no evidence that rates of 
diabetes and hypertension, or the distribution of sex or 
deprivation, varied between coding groups. Greater 
numbers of smokers were found in the paired GP/ 
hospital group for patients with MI (45% in the paired 
group compared with 28%— 34% in the other groups, 
p=0.028) and IHD (35% compared with 24%-27%, 
p=0.021). The level of comorbidity for all conditions, as 
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Table 1 Variation of patient characteristics with different methods of identifying cases 
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Patient characteristics for myocardial infarction, ischaemic heart disease and cerebrovascular disease, identified using GP, hospital, paired GP/ 
hospital and pooled GP/hospital data. Deprivation quintile 1 is least deprived. Significant differences are calculated by x 2 test or Kruskal-Wallis 
analysis of variance. 
GP, general practitioner. 



measured by the Charlson Index, is lower in the paired 
GP/hospital group (1.8, 1.3 and 1.9 for MI, IHD and 
CVD, respectively) and higher in the hospital group (2.2, 
1.7 and 2.4, respectively, p<0.014). For IHD and CVD, 
there is evidence that patients identified using solely GP 
or solely hospital data were slightly younger. 

Prescribing 

Differences in prescribing rates were observed between 
coding groups (table 2). These were most marked for 
IHD, where rates of prescribing of ACE inhibitors, 



(3-blockers, nitrates, statins and antiplatelet agents were 
higher in the paired group (p<0.013). However, this 
finding did not appear to be replicated for MI specifically. 
For CVD, prescribing rates for statins and antiplatelet 
agents were lower in the hospital group (p<0.022). 

case fatality 

Considerable 30-day case fatality rate differences exist for 
all three conditions depending on the coding used 
(p<0.002, table 3). Rates for all conditions are highest in 
patients coded only in hospital and lower in the GP and 
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Figure 2 Incidence rates, expressed per 100 000 patient- 
years, for different clinical conditions over a 2-year time period 
beginning 1 January 2005, based on general practice (GP), 
hospital, paired GP/hospital and pooled GP/hospital data. 
CVD, cerebrovascular disease; IHD, ischaemic heart disease; 
Ml, myocardial infarction. 

paired GP/hospital groups. The most striking differences 
were observed for MI, with a 30-day case fatality rate of 
20% for the hospital group but only 4% for the GP group. 

DISCUSSION 

In a world where electronic healthcare data are 
becoming increasingly used for the purposes of clinical 



trials and epidemiological research, there is a need for 
researchers to understand whether additional informa- 
tion can be gained by linking two (or indeed more) 
electronic health record data sources together. However, 
where there is overlap between the constituent data sets, 
such as with coding of clinical conditions, the researcher 
needs to decide which data set to rely on for identifying 
cases, or indeed whether combining information from 
both the data sets may be of value. Our study demon- 
strates that the method of coding MI, IHD and CVD 
appears to result in identification of different types of 
patient, in particular as characterised by prescribing and 
case fatality rates. Incident rates of disease also vary 
depending on the coding method used. 

Previous work examining the epidemiology of cardio- 
vascular disease has been conducted in Scotland using 
routine clinical data. Primary care data have been used 
to demonstrate that IHD is a common problem associ- 
ated with male gender, increasing age and socioeco- 
nomic deprivation. 12 Yet the recording of IHD data 
varies in general practice with different methods used 
for case detection. 13 Furthermore, external factors such 
as payment-for-performance have been shown to 
improve the recording of IHD-related health 



Table 2 Variation of patient characteristics with different methods of identifying cases 



GP 



Hospital 



Paired GP/hospital 



Pooled GP/hospital 



p Value 



Myocardial infarction 



N 


139 


137 


99 


173 




ACE inhibitor/ARB (%) 


68 


77 


77 


71 


0.30 


(3-blocker (%) 


68 


61 


59 


61 


0.50 


Calcium channel blocker (%) 


10 


10 
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0.29 
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32 


32 
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29 


0.87 
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46 


61 


59 
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Antiplatelet agent (%) 
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0.43 
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33 
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40 


43 


60 


40 


<0.001 
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67 
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<0.001 


Antiplatelet agent (%) 
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71 


87 


66 


<0.001 


erebrovascular disease 
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38 


33 


31 
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0.42 


[3-blocker (%) 


25 


19 


22 


19 


0.16 


Calcium channel blocker (%) 


20 


15 


13 


17 


0.27 


Diuretic (%) 


32 


33 


32 


33 


0.99 


Nitrate (%) 


15 


14 


15 


13 


0.94 


Statin (%) 


56 


41 


53 


50 


0.006 


Antiplatelet agent (%) 


54 


44 


50 


55 


0.022 



The 30-day prescribing rates for myocardial infarction, ischaemic heart disease and cerebrovascular disease, identified using GP, hospital 
paired GP/hospital and pooled GP/hospital data. Patients are those alive at 30 days, and this is reflected by lower numbers of patients than in 
tables 1 and 3. Significant differences are calculated by x 2 test. 
ARB, angiotensin receptor blocker; GP, general practitioner. 
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Table 3 Variation of case fatality rates with different methods of identifying cases 
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Hospital 


Paired GP/hospital 


Pooled GP/hospital 


p Value 
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105 


209 
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20 
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362 


529 


270 


585 




30-day case fatality rate (%) 


2 
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0.002 


Cerebrovascular disease 












N 


302 


330 


153 


424 




30-day case fatality rate (%) 


6 


16 


5 


10 


0.001 


The 30-day case fatality rates for myocardial infarction, ischaemic heart disease and cerebrovascular disease, identified using GP, hospital, 
paired GP/hospital and pooled GP/hospital data. The significance of the differences between coding methods is adjusted for confounding 
factors using logistic regression (see text for details). 
GP, general practitioner. 



indicators. Such incentivisation was introduced to UK 
general practice (but not hospital practice) in 2004, and 
so it is possible that this may have reduced the discrep- 
ancies between hospital and GP data in our study. 
Interestingly, pooling of GP and SMR records has 
previously been advocated for detecting MI cases, 15 and 
pooled GP/SMR data from the same data set we used 
have demonstrated differences between cohorts of inci- 
dent and prevalent MI. 16 However, the effect of using 
only one component of such a data set has been hitherto 
unknown. 

Reasons for differences in incidence rates and patient 
characteristics 

Our data do not allow us to determine the exact cause of 
our findings, but a number of hypotheses may be 
proposed. Incident disease is reassuringly similar 
between GP and hospital groups for MI and CVD. The 
lower incidence of IHD for the GP group reflects the fact 
that many patients will have had relatively stable coro- 
nary disease for a number of years but not necessarily 
required acute hospital admission. Thus, many GP 
episodes of IHD do not count as true incident cases as 
they have had prior contact with the GP, whereas 
a higher number of hospital episodes are incident cases 
as these patients have never been previously admitted. 
The lower incidence rates for the paired GP/hospital 
group, and higher incidence rates for the pooled GP/ 
hospital group, are inevitable consequences of the way in 
which the two data sets are united, although the 
magnitude of these differences will nonetheless reflect 
the degree of inconsistency in coding between the two. 
Furthermore, it would appear that because the paired 
GP/hospital data considerably underestimate the true 
disease incidence, it is probably not a useful method for 
identifying cases, even though such cases might be more 
rigorously identified. In addition, the increase in inci- 
dence rate using the pooled GP/hospital data demon- 
strates the potential advantage of combining two data 
sets, over use of a single data set, from the perspective of 
improving case finding. 



The discrepancies in death rates are probably relatively 
straightforward to explain. Acute MI admission has 
a high case fatality, 1 but those surviving beyond discharge 
have a much lower case fatality subsequently. It seems 
likely that the GP may fail to record the cause of death in 
patients who do not survive the hospital admission, thus 
resulting in the lower case fatality rates observed in the 
paired GP/hospital coding group. Furthermore, it is 
possible that patients coded only by the GP may repre- 
sent 'less serious' illness, where hospitalisation is not 
deemed necessary by the GP. It is recognised that many 
patients suffering relatively minor strokes may not be 
admitted to hospital, 17 resulting in lower case fatality for 
CVD in the GP group, although with the growing avail- 
ability of active treatment options for ischaemic stroke in 
the form of thrombolysis, this may well change. We used 
national mortality data to identify deaths from both GP 
and SMR data sets, so discrepancies in recording of 
death between GP and hospital are unlikely to explain 
the differences in case fatality rates observed. Further- 
more, the majority of paired events share exactly the 
same date, suggesting that retrospective date entry by the 
GP of the hospital event is common, and thus, there is no 
reason why this could not be carried out for fatal events. 

The higher prescribing rates for IHD in the paired 
coding group are probably due to GPs responding 
appropriately to secondary care instigated intervention, 
reflected in appropriate treatment. That such differ- 
ences were not observed for MI may be due to better 
communication and awareness for this specific condition 
compared with other IHD, such as angina, meaning that 
prescribing in the hospital group appears just as good as 
for the paired GP/hospital group. However, fewer MI 
events may have left us underpowered to detect differ- 
ences. The lack of difference in the GP and paired 
groups for CVD may reflect poorer awareness of stroke 
management guidelines 18 in comparison with coronary 
heart disease, and so prescribing rates are consequently 
no higher in the paired group. The lower prescribing 
rates of statins and antiplatelet agents in the CVD 
hospital group may reflect the GP being unaware of 
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these patients' clinical need resulting in undertreat- 
ment; this is supported by the higher prescribing rates in 
the paired group. The differences in other patient 
characteristics — specifically smoking and comorbidity — 
are less easy to understand but may represent increased 
disease severity and mortality in hospitalised smokers 
and multimorbid patients. The small differences in age 
(<3 years) seem unlikely to be clinically relevant, 
although may be pertinent from the public health 
perspective. Finally, it may be that miscoding of diag- 
noses may explain some of the above differences; for 
instance, heart failure may be used as an alternative but 
incorrect code for MI. 19 Furthermore, the introduction 
of sensitive troponin assays has influenced MI detection 
rates 20 ; it is possible that lack of familiarity among some 
clinicians for the resulting terms (eg, non-ST elevation 
MI, acute coronary syndrome) may result in inaccurate 
diagnoses being recorded. 

Limitations 

This study has highlighted important issues related to 
patient coding and linked data, but although it has the 
advantage of using a reasonably large routine data set, 
linked at the individual patient level, a number of issues 
and limitations should be considered. The relatively 
small number of GP surgeries (40) may not have been 
fully representative. In addition, the number of events is 
relatively small, and given the conservative nature of the 
% 2 test, this increases the possibility of type 2 errors; thus, 
a larger data set may have identified more differences 
between groups. We restricted our analysis to four simple 
coding groups — GP, hospital, paired and pooled GP/ 
hospital. However, it is clear that there are many further 
ways of categorising events, including the presence or 
absence of prior or subsequent coding based on the 
alternative half of the data set. For instance, an incident 
GP event with a historical hospital event may be coded 
differently to a GP event with no previous hospital 
record. However, we found that many of these theoret- 
ical categories have only a handful of cases. Further- 
more, even when we examined six or seven separate 
smaller coding categories, similar differences in patient 
characteristics persisted between groups (data not 
shown). Our choice of four main groups was therefore 
a pragmatic one, which reflects the choice that would 
face a researcher dealing with a similar linked data set. 
The decision to use a 30-day limit for pairing data could 
also be questioned; we are unable to prove that these two 
events are truly the same clinical episode. The choice was 
again, therefore, partly pragmatic, although supported 
by examination of the distribution of time gaps between 
the GP and hospital data. We did not limit the lead-in 
time period prior to 1 January 2005 in any way. Length of 
GP records is generally greater and more variable than 
SMR records, and there is the potential to see a lower 
number of new incident events among persons with 
longer GP records. Our study used routine GP data, and 
it is possible that such profound differences may not be 
found with research-standard databases, such as General 



Practice Research Database (GPRD). 21 Nonetheless, 
work linking primary care research databases to hospital 
(and other) records is ongoing, and the issues raised by 
our study must be acknowledged. The SMR data set only 
records hospital events in Scotland and thus fails to 
capture events in elsewhere in the UK or abroad. Similar 
issues face the English equivalent Hospital Episode 
Statistics, and a UK-wide hospital events data set would 
be valuable. SMR (and Hospital Episode Statistics) also 
provide multiple diagnostic codes for a single event. We 
elected to use all six diagnostic positions to ensure 
maximum capture of relevant hospital events. However, 
the robustness of low-priority diagnoses might be ques- 
tioned. Nonetheless, we found similar results when we 
used only two diagnostic positions (data not shown). We 
also did not examine miscoding of events — for example, 
a code of angina being used rather than the code for MI. 
Coding of SMR is considered 99% complete and 88% 
accurate 8 ; corresponding metrics are not available for 
PTI data (although the completeness and accuracy of 
Read coding of morbidity in Scottish general practice 
has been shown previously to be greater than 91 % 22 ). 
Furthermore, the two data sets use different coding 
systems, so completely reliable comparison is not 
possible. However, we used relatively broad definitions, 
and the Read code system is based on ICD. Nonetheless, 
we may in particular have missed some administrative 
Read codes, which might have enabled identification of 
additional cases in the GP group. Of course, ideally 
further validation of the coding should be conducted; 
linkage to laboratory data might be one way of achieving 
this. Finally, our 30-day limit for prescribing was selected 
from a pragmatic perspective. However, it is possible that 
patients who were admitted for over 30 days would not 
have had a new prescription issued by the GP within the 
30-day post-event period, resulting in an apparent 
underestimation of prescribing. We believe that these 
numbers will be relatively small, however, and unlikely to 
alter the overall interpretation of our findings. 

Research and policy implications 

These results have significant implications for linked 
data; the drug management, disease severity and to some 
degree the patient characteristics vary depending on 
how the disease cohort is defined. They also have 
implications for the use of unlinked routine data — use of 
isolated primary or secondary care data may result in 
a biased selection of patients. This may affect patient 
recruitment as well as the validity and reliability of such 
information sources as secondary data in clinical trials, 
including clinical outcomes. It is similarly relevant to the 
public health environment. Using linked data allows one 
to have a more robust definition, by using pairs of GP 
and hospital codes only, but it is clear that the apparent 
incidence of a disease will be considerably lower. Alter- 
natively, linked data enable a looser but more inclusive 
disease definition, using both GP and hospital data, but 
not relying on the coding occurring simultaneously. 
When using separate data from only one source, one 
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needs to take into account that patient characteristics 
may not be representative of the wider population. It is 
difficult to recommend one coding approach over 
another, however, and the decision will need to be based 
on the specific question being posed. 

CONCLUSIONS 

In conclusion, patient characteristics vary depending on 
whether GP, hospital or combined definitions of 
cardiovascular events are used. In particular, disease 
severity as measured by mortality varies considerably. 
This has important implications for studies using linked 
routine primary and secondary care data, and for studies 
where information is only available from one of these 
sources. These issues should be acknowledged by studies 
using routine data as a secondary data source, and 
further work is merited to examine whether similar 
discrepancies exist for other clinical conditions or within 
primary care research databases. 
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